library(tidyverse)
library(plotly)
library(gridExtra)
library(reshape2)
library(DT)
data <- read.csv("data.csv", encoding = 'UTF-8') %>% filter(year <= 2020)
data <- data %>% 
  select(id, name, artists, year, release_date, everything())

Terms like Data is the new _____, substituted with gold/oil/currency/whatever is the most precious asset to have for you, are becoming quite prevalent in our conversations and news feed. As per the statista reoprt released in May 2020, 2021 is expected to create, capture, copy, and consume 74 zettabytes (1 zettabyte = trillion gigabytes) of data worldwide. This number is projected to almost double in just 3 years by 2024.

Statista report on data creation/consumption

Statista report on data creation/consumption

The episode data detective of the cautionary tales podcast by Tim Hartford aptly mentions that every human being is moderately curious or moderately incurious. It further suggests that there is no such thing as a curious person. Curiosity is a cultivated trait and can be infused or diffused by context and subject.

Being surrounded by mountains and mountains of data can ignite curiosity even in the most bored minds like mine. Despite being an avid and yet not bored listner of Indian classical and pop music, I have decided to explore the wilderness of Western music (for the lack of a better term) using Spotify data in hope to learn a few things on the way.

Data at perusal

I have obtained all Spotify tracks for US market ranging from 1920 to 2020 (100 years!) from a Kaggle dataset. Each row ideally represents a track, with a Spotify track id, artist’s name, track’s name, and a bunch of audio features associated with the track. A sample of the data from year 2020 is shown below.

datatable(
  head(
    data %>% 
      filter(year==2020)
    ), 
  rownames = FALSE, 
  options = list(dom = 'tp',
                 pageLength = 5,
                 scrollX = TRUE)
  )

Before our curiosity nudges us to look for data related to our favorite artists, let’s first understand the data.

Data types

We have few categorical fields like id, name, artists, release_date, and key that can take finite number of values. We also have many numerical fields, most from the audio features, like acousticness, danceability etc., that can take infinite number of values within a defined range. Notice that we have two fields - mode and explicit that are logical/binary in nature.

Data definitions

The table below has the definitons of all the fields including the definitions of various audio features.

data_dict <- tibble("field" = names(data),
                    "definition" = c("Track unique id",
                                     "Track name",
                                     "Artist name",
                                     "Year the track was released",
                                     "Date the track was released",
                                     "A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.",
                                     "Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.",
                                     "Duration of the song in milliseconds",
                                     "Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.",
                                     "If the track contains explicit content",
                                     "Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.",
                                     "The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C#/Db, 2 = D, and so on. If no key was detected, the value is -1.",
                                     "Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.",
                                     "The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.",
                                     "Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.",
                                     "The popularity of the track. The value will be between 0 and 100, with 100 being the most popular. The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past.",
                                     "Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.",
                                     "The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.",
                                     "A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry)."))

datatable(
  data_dict,
  rownames = FALSE,
  options = list(dom = 'tp',
                 pageLength = 5, 
                 scrollX = TRUE),
  caption = "Sourced from https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md"
  )

After performing few validation checks on the ideal state of this data (one record per one song), certain issues appeared with this data in terms of how it is curated. A track can be added to Spotify as a single entry and/or as part of an album with an individual id for each occurrence. The same track can also be added multiple times with different release dates. As per the their documentation, this used to be the case more often in the older years. With the implementation of “track linking” in the most recent years, such duplicative pattern is less seen in the data. However, we need to overcome this issue in the data, otherwise we would be double counting the statistics which could be misleading.

One way to tackle this issue is to keep the track, a combination of an artist and a track name, that was released the earliest when it appears multiple times. For example, Rain On Me (with Ariana Grande) track has two entries in the data with release dates of May 22 2020 and May 29 2020 from which we would choose the one released on May 22 2020. By following this method, we will not be favoring the popularity measure (correlated with recent number of plays) for a track and an artist that could have tracks released across various years with varying popularity measures.

After cleaning the data, we have total of 158,581 tracks a reduction of ~14,000 records with duplicate entries.

datatable(
  data %>%
    filter(name=="Rain On Me (with Ariana Grande)"),
  rownames = FALSE,
  options = list(dom = 'tp',
                 pageLength = 5)
  )
key_map <- rev(c("0" = "C", "1" = "C#", "2" = "D", "3" = "D#", "4" = "E", "5" = "F", "6" = "F#", "7" = "G", "8" = "G#", "9" = "A", "10" = "A#", "11" = "B"))

data_prc <- data %>%
  mutate(id = as.character(id)) %>%
  #select(-id, -release_date) %>% # duplicate records for different ids and release date
  distinct() %>%
  # same song with same artist have multiple records with slight changes in audio featues and year
  arrange(name, artists, year, release_date) %>% 
  group_by(name, artists) %>%
  filter(id==min(id)) %>%
  #filter(popularity == max(popularity)) %>%
  ungroup() %>%
   mutate(popularity_category = ifelse(popularity >= 80, "80+", "<80"),
          valence_bin = cut(valence, seq(0,1,0.1), right = FALSE),
          duration_min = duration_ms/(1000*60),
          mode_type = case_when(mode==0 ~ "minor",
                                mode==1 ~ "major"),
          key_str = as.character(key),
          key_group = str_replace_all(key_str, key_map))

Correlation of audio features

Plotting the trends of audio features over years invoked some curiosity around if these features are inter related and if so, how?

Below graph shows corrleation of each audio feature with other audio features. Note that mode and key are excluded from this matrix as they are more like categorical variables. Colors closer to orange showcase higher positive correlation, colors closer to purple showcase neutral correlation, while colors closer to dark blue show negative correlation.

It is quite trivial to point out that each audio feature is correlated to itself with 100% positive correlation. Energy is negatively correlated with acousticness which aligns with what we observed in the trends over years earlier.

Popularity is negatively correlated with acousticness and positively correlated with energy.

Loudness is highly correlated with energy which makes loudness negatively correlated with acousticness too.

Danceability is positively correlated with valence of the track, not surprising!

ggplotly(
  melt(cor(data_prc %>%
      select(c(audio_features_2, "popularity")))) %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "#003366", high = "orange", mid = "purple") +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(x = "", y = "")
)

Each of the above mentioned correlation observations are visualized in one chart below to fit linear trends. When we want to predict the value of an audio feature from the rest of audio features, these correlation plots could be very useful to understand the dependencies.

p_acoustic <- data_prc %>%
  mutate(acoustic_bin = cut(acousticness, seq(0,1.1,0.0001), right = FALSE)) %>%
  group_by(acoustic_bin) %>%
  summarise(mean_popularity = mean(popularity),
         mean_acoustic = mean(acousticness)) %>%
  ungroup() %>%
  ggplot(aes(x = mean_acoustic, y = mean_popularity)) +
  geom_point(alpha = 0.2, size = 3) +
  geom_smooth(method = "lm") +
  labs(title = str_glue("correlation = {round(cor(data_prc$acousticness, data_prc$popularity),2)}"))
p_energy <- data_prc %>%
  mutate(energy_bin = cut(energy, seq(0,1.1,0.0001), right = FALSE)) %>%
  group_by(energy_bin) %>%
  summarise(mean_popularity = mean(popularity),
         mean_energy = mean(energy)) %>%
  ungroup() %>%
  ggplot(aes(x = mean_energy, y = mean_popularity)) +
  geom_point(alpha = 0.2, size = 3) +
  geom_smooth(method = "lm") +
  labs(title = str_glue("correlation = {round(cor(data_prc$energy, data_prc$popularity),2)}"))
p_energy_loudness <- data_prc %>%
  mutate(loudness_bin = cut(loudness, seq(0,-60,-0.01), right = FALSE)) %>%
  group_by(loudness_bin) %>%
  summarise(mean_energy = mean(energy),
         mean_loudness = mean(loudness)) %>%
  ungroup() %>%
  ggplot(aes(x = mean_loudness, y = mean_energy)) +
  geom_point(alpha = 0.2, size = 3) +
  geom_smooth(method = "lm") +
  labs(title = str_glue("correlation = {round(cor(data_prc$energy, data_prc$loudness),2)}"))
p_valence_dance <- data_prc %>%
  mutate(dance_bin = cut(danceability, seq(0,1,0.0001), right = FALSE)) %>%
  group_by(dance_bin) %>%
  summarise(mean_valence = mean(valence),
         mean_dance = mean(danceability)) %>%
  ungroup() %>%
  ggplot(aes(x = mean_dance, y = mean_valence)) +
  geom_point(alpha = 0.2, size = 3) +
  geom_smooth(method = "lm") +
  labs(title = str_glue("correlation = {round(cor(data_prc$valence, data_prc$danceability),2)}"))
grid.arrange(p_acoustic, p_energy, p_energy_loudness, p_valence_dance, nrow = 2)

Top 20 most productive artists

Below chart shows artists by number of songs released on Spotify over 100 years. The shading represents time between release of their first track to last track. Top 4 most productive artists by a bigger margin have been active for < 30 years between 1920 to 1950. Higher number of active years may have some correlation with popularity as they could consistently releasing tracks over many years. Ella Fitzgerald is one of the early artists (from 1920s) whose last track was added in 1999 yet ranks pretty high on the popularity spectrum (73). Frank Sinatra has been active for most amount of years among the top 20 artists.

I could not resist but notice our beloved Lata Mangeshkar ji’s name sitting nicely between The Beatles and Queeen

ggplotly(
  data_prc %>%
  group_by(artists) %>%
  summarise(n_songs = n(),
            first_activity = min(year),
            last_activity = max(year)) %>%
  ungroup() %>%
  mutate(years_active = last_activity - first_activity + 1) %>%
  arrange(desc(n_songs)) %>%
  head(20) %>%
  ggplot(aes(x = reorder(artists, n_songs), y = n_songs, fill = years_active)) +
  geom_col() +
  labs(y = "artist", x = "tracks") +
  coord_flip()
)

Audio features of tracks by Lata Mangeshkar

The most popular song of Lata Mangeshkar is Aaj Phir Jeene Ki Tamanna Hai released in 1965. Below chart shows that her tracks are on the high spectrum of acousticness (not surprising). Danceability is in the medium range. Her tracks on Spotify are mostly on the higher end of valence. Since Spotify might have limited tracks from this prolific artist, it might be best to no conclude anything in particular.

# data_prc %>%
#   filter(artists=="['Lata Mangeshkar']") %>%
#   arrange(desc(popularity))

ggplotly(
  scales_data_prc %>%
  filter(artists=="['Lata Mangeshkar']") %>%
  pivot_longer(audio_features_2, names_to = "feature_name", values_to = "feature_value") %>%
  ggplot(aes(x = feature_name, y = feature_value, color = feature_name)) +
  geom_jitter(size = 3, alpha = 0.4) +
  theme(axis.text.x = element_text(angle = 90)) +
  scale_color_brewer(palette = "Set3")
)

Common words in track names

Top 100 words from 100 years have Love and live as the most common words in all track names after excluding commonly used words like a/an/the/you/me/I/mixed/remaster etc. and years. I also wonder which live the verb or the now has been used the most commonly in track names. This might require an another blog on sentiment analysis. For now, let’s wish it is live the verb.

Top 100 words over 100 years

Top 100 words over 100 years

It is quite interesting to see words related to work out as more common words in tracks in year 2020.

Top 100 words in 2020

Top 100 words in 2020

Concluding remarks

As you already might be feeling exhausted by going through the various slices of data and extracting some insights from them, such exploratory data analysis could become a never ending task without a clear end goal in mind. It is a common practice to perform such explorations before developing a predictive model in the field of data science. As we gained decent understanding of the data and their correlation with each other, one could build a model to predict popularity based on audio features. Moreover, with additional data on genres and artists, a recommendation engine could also be developed.

Music has always been a steady companion for me, especially in the pandemic. It moves me emotionally and physically and lets me live in a single state of mind - ecstacy. Hope you had some of your musical and analytical curiosiry satisfied through this blog.